52
Algorithms for Binary Neural Networks
large-scale problem on a huge data set [53]. We propose to solve the problem within the
backpropagation framework by considering: 1) the inference process of the optimized model
is based on the quantized variables, which means that the variable must be quantized in
the forward pass (corresponding to the inference) during training, and the loss is calculated
based on the quantized variables; the backpropagation is not necessarily to be quantized,
which however needs to consider the relationship between quantized variables and their
counterparts fully. Based on the above considerations, we propose that in the kth iteration,
based on the projection in Eq. 3.29, x[k] is quantized to ˆx[k] in the forward pass as
ˆx[k] = PΩ(ω, x[k]),
(3.30)
which is used to improve the backpropagation process by defining an objective as
min
f(ω, x)
s.t.
ˆx[k]
j
= P j
Ω(ωj, x),
(3.31)
where ωj, j ∈{1, ..., J} is the jth projection matrix3, and J is the total number of projection
matrices. To solve the problem in (3.31), we define our update rule as
x ←x[k] −ηδ[k]
ˆx ,
(3.32)
where the superscript [k +1] is removed from x, δˆx is the gradient of f(ω, x) with respect to
x = ˆx, and η is the learning rate. The quantization process ˆx[k] ←x[k], that is, P j
Ω(ωj, x[k]),
is equivalent to finding the projection of ωj ◦(x + ηδ[k]
ˆx ) onto Ω as
ˆx[k] = arg min
ˆx {∥ˆx −ωj ◦(x + ηδ[k]
ˆx )∥2, ˆx ∈Ω}.
(3.33)
Obviously, ˆx[k] is the solution to the problem in (3.33). So, by incorporating (3.33) into
f(ω, x), we obtain a new formulation for (3.31) based on the Lagrangian method as
min f(ω, x) + λ
2
J
j
∥ˆx[k] −ωj ◦(x + ηδ[k]
ˆx )∥2.
(3.34)
The newly added part (right) shown in (3.34) is a quadratic function and is referred to as
projection loss.
3.5.3
Theoretical Analysis
We take a close look at the projection loss in Eq. 3.34; we have
ˆx[k] −ω ◦(x + ηδ[k]
ˆx ) = ˆx[k] −ω ◦x −ω ◦ηδ[k]
ˆx .
(3.35)
In this case, we only consider one projection function, so the subscript j of ωj is omitted for
simplicity. For multiple projections, the analysis is given after that. In the forward step, only
the discrete kernel values participate in the calculation, so their gradients can be obtained
by
∂f(ω, ˆx[k])
∂ˆx[k]
= ω ◦δ[k]
ˆx ,
(3.36)
3Since the kernel parameters x are represented as a matrix, ωj denotes a matrix as ω.